Automated recognition of emotional states of horses from facial expressions

Animal affective computing is an emerging new field, which has so far mainly focused on pain, while other emotional states remain uncharted territories, especially in horses. This study is the first to develop AI models to automatically recognize horse emotional states from facial expressions using data collected in a controlled experiment. We explore two types of pipelines: a deep learning one which takes as input video footage, and a machine learning one which takes as input EquiFACS annotations. The former outperforms the latter, with 76% accuracy in separating between four emotional states: baseline, positive anticipation, disappointment and frustration. Anticipation and frustration were difficult to separate, with only 61% accuracy.


Introduction
The prevailing consensus now acknowledges that animals experience not only negative emotions such as fear and distress [1], but also positive emotional states [2].While the historical focus of animal welfare science centered on pain and suffering, a recent notable shift in perspective encompasses a broader evaluation of their overall quality of life [3].This shift leads to increased interest also in animal emotion research, and specifically positive emotional states [4,5].
Facial expressions are an important information channel for affective states in animals.Charles Darwin famously expounded upon how facial expressions serve as manifestations of emotional states in both human and diverse non-human species [6], however while he proposed a commonality across species for a given emotion, this finding has recently been challenged.Thus while mammals are known to produce facial expressions [7], the mechanistic rules governing this in relation to internal emotional states may vary between species.Consequently, facial expressions and their variability between species are getting increased interest as potential indicators of internal states in the domain of animal emotions and welfare research.
The golden standard for the objective evaluation of dynamics of facial expressions within the realm of human emotion research is the Facial Action Coding System-FACS [8,9].FACS has recently been adapted for different non-human species, including several non-human primates (e.g.orangutans [10], chimpanzees [11], macaques [12,13]), marmosets [14], dogs [15], cats [16] and horses [17].The latter are of particular interest due to their lateral eye placement accompanied by face elongation.
Horses are an understudied species in the context of emotion research.Being highly social animals, they have complex social frameworks [18,19].They have a well-developed communication through nuanced visual cues, including subtle shifts in eye direction, ear positioning, and facial expressions [17,20].Wathan et al. [21] presented evidence for their ability to distinguish between distinct facial expressions when presented with images of fellow horses, such as those conveying aggressive, positively attentive, or relaxed states.
As with many other species, facial expressions in horses have so far been investigated mainly in the context of pain.A number of grimace scales for assessing pain in horses, such as the Horse Grimace Scale (HGS) [22], Equine Pain Face [23] The Equine Utrecht University Scale for Facial Assessment of Pain (EQUUS-FAP) [24] and the FEReq instrument for ridden horses [25].Merkies et al. [26] studied eye blink and eyelid twitches in relation to negative affective states.Hintze et al. [27] further focused on the eye area addressing eye wrinkles as a potential tool to evaluate emotional valence in horses, testing horses in different situations, two of them being food anticipation (positive emotional valence) and food competition inducing frustration (negative emotional valence).An important feature for separation between these positive and negative situations was found to be the change in the angle between the highest wrinkle and the line through the eyeball.Ears are another important facial parts studied in the context of affective states in horses and found to be of importance in fear [28], vigilance [29], as well as pain [22].
Ricci-Bonot and Mills [30] studied facial expressions in horses in a controlled experiment, testing n = 30 horses across three situations involving potential availability of food: one positive situation-anticipation of a reward, and two negative situations-frustration at waiting for a reward, and disappointment at the loss of the reward.Horse facial expressions were coded using the EquiFACS coding system.While the study could not identify facial markers to differentiate anticipation, significant difference was found in the occurrence of 9 actions and behaviors between the two negative situations.The action units 'eye white increase' (AD1), 'ear rotator' (EAD104), and 'biting feeder' were more likely in the frustration phase, while 'blink' (AU145), 'nostril lift' (AUH13), 'tongue show' (AD19), 'chewing' (AD81) and 'licking feeder' were more likely in the 'disappointment' phase.
Manual behavior analysis methods have many limitations, such as being prone to bias and error [31], as well as requiring rater agreement studies and extensive human training.Computer Vision based approaches provide an attractive alternative.Broome et al. [32] provides a comprehensive review of state-of-the-art approaches of this type in the context of affect recognition in animals.
As already indicated, the majority of these works focus on pain recognition, addressing species including rodents [33][34][35], sheep [36], and cats [37].Several works have addressed automation of pain recognition in horses [38][39][40].Lencioni et al. [38] presented a model, based on a Convolutional Neural Network (CNN), with an overall accuracy of 75.8% while classifying pain on three levels: not present, moderately present, and obviously present.While classifying between two categories (pain not present and pain present) the overall accuracy reached 88.3%.However, the validation method used did not leave one animal out, which is the golden standard in this context (see [32] for a discussion), and may have resulted in lower performance.This work used only input of single frames.Another study focusing on horse facial pain expressions was presented by Pessanha et al. [41].The presented pipeline automatically determines the quantitative pose of the equine head and localizes facial landmarks, based on which classification is made.The manual scoring of pain was performed using the Equine Utrecht University Scale for Automated Recognition in Facial Assessment of Pain (EQUU-S-ARFAP) [42].This scale has not been validated, moreover a significant disagreement between scorers was reported.The pain prediction is done for each region of interest separately, models for some regions had a good performance in binary classification of pain/no pain (orbital tightening had F1 score of 0.86, ears 0.72), while the majority had lower performance.Hummel et al. [40] also focused on facial expressions for pain recognition in horses, presenting a hierarchical system for pose-specific automatic pain prediction on horse faces, exploring also its extension to donkeys.While 0.51-0.88F1 score was achieved in pain recognition in horses, the transfer to donkeys was difficult.Another work on horse pain by Broome et al. addressed the whole body of the horse and used more sophisticated methods taking videos as input [39].In follow-up studies, also transfer from acute to low grade orthopedic pain in horses was also addressed [43], as well as semi-supervised approaches with video [44].
To the best of our knowledge, only one study addressed so far emotional state recognition in horses.Corujo et al. [45] addressed some states including "alarmed", "annoyed", "curious", and "relaxed", defining each of them in terms of eyes, ears, nose and neck behavior.However, these definitions are not objective and nor operationally defined and so are open for observer interpretation (using descriptions such as 'relaxed'), leading to low reliability of ground truth annotation.
The study presented here is the first to explore automated recognition of horse emotional states from facial expressions, using a dataset collected from a carefully designed experimental protocol of Ricci-Bonot and Mills [30].In the protocol, similar to the one developed in [46] for dogs, the context defines the emotional states of horses, which were tested in three different scenarios involving the potential availability of food: anticipation of a reward, considered a positive emotional state; frustration at waiting for a reward and disappointment at the loss of the reward-both considered negative emotional states.Tests were conducted in a stable with a feeding device fixed outside the stable within reach of the horse.Analysis of video recordings of facial expressions of the horses was undertaken using the Horse Facial Action Coding System (EquiFACS), an objective system for coding facial movements on the basis of the contraction of underlying muscles, as well as their behaviors.This dataset creates a unique experimental environment for exploring different machine learning approaches in the context of emotion recognition.Specifically, we explore two routes to automated emotion recognition.The first approach uses deep learning, taking videos as input, and analyzing them frame by frame, then aggregating them for an emotional state prediction.The second approach takes as input the EquiFACS coding of the video and uses machine learning for making a prediction of an emotional state.

Dataset
The dataset used in this study was collected as part of a previous study by Ricci-Bonot and Mills [30].The delegated authority of the University of Lincoln Research Ethics Committee approved this research (UoL2021_6910) and all methods were carried out in accordance with the University Research Ethics Policy and the ethical guidelines of ISAE [47].Written informed consent was obtained from the owner of all horses used in the research.No further ethical approval was required for the current in silico work.All experiments were performed in accordance with relevant guidelines and regulations.The study is reported in accordance to ARRIVE guidelines.
A total of 30 videos were obtained from 31 horses involved in the experiment conducted by Ricci-Bonot et al. [30].The horses belonged to different breeds, including Cob Normand, French saddle, Haflinger, Hungarian, Pinto cross Trotter, and some of unknown breed.The age range of the horses was 2 to 23 years, with an average age of 11.5 years and a standard deviation of 6.6.The experiment included 1 entire male, 10 geldings, and 20 females.One horse failed the training phase for food anticipation and all its videos were consequently excluded from the experiment.
The dataset included overall 296 video samples of 3-seconds length recorded at a frame rate of 60 frames per second, each frame resolution is 1920x1080 pixels.
Tests were conducted in a stable with a feeding device fixed outside the stable within reach of the horse using the protocol which is fully described in Ricci-Bonot and Mills [30].Each subject was tested and recorded once in the baseline condition and three times on each of the anticipation, frustration and disappointments conditions, resulting in a data compound of 87 recordings of anticipation states, 30 recordings of baseline states, 90 recordings of disappointment states and 89 recordings of frustration states.Some videos cannot be used due to lack of enough visibility.
All the video samples were coded by a certified EquiFACS coder (C.R.B.) based on the Equi-FACS manual.All action units, action descriptors and other variables were coded as present or absent; for the analysis, only EquiFACS variables shown to be reliable by a second EquiFACS coder and occurring in more than 10% of one of the four situations (baseline, anticipation, frustration and disappointment) were considered; In order to ensure the reliability of the coding, a second certified EquiFACS coder (N.J.) coded more than 10% of the video samples.Table 1 presents the EquiFacs variables that were eventually used in the analysis.

AI pipelines overview
For narrative purposes we preface our results with essential and practical aspects to improve understanding for those less familiar with AI methods, presenting a high-level overview of the used approaches.
We compare the pain classification performance of two different pipelines utilizing two different types of input.The first pipeline takes as input 3sec long video recordings, the second takes as input the EquiFACS coding information.

Video classification pipeline
The pipeline for video classification used in this study follows the approach presented in [48], making a sophisticated use of the availability of video data in two ways: we integrate temporal information by using the Grayscale Short-Term stacking (GrayST) method [49] to encode movement between consecutive frames into one frame.In addition, we also apply a frame selection technique to better exploit the availability of video data and improve performance.
The input to the model are videos.To remove background information, we crop the horse faces using Yolov5 object detection model [50].Then we apply the GrayST method to incorporate temporal information for video classification without augmenting the computational burden.This sampling strategy involves substituting the conventional three color channels with three grayscale frames, obtained from three consecutive time steps.Consequently, the backbone network can capture short-term temporal dependencies while sacrificing the capability to analyze color.The next stage involves encoding each image into a 768-dimensional embedding vector employing a Visual Transformer (ViT [51]) trained in a self-supervised manner using DINO [52] with a batch size equals to 8. We extract the output of the final layer as a 768-dimensional embedding vector that will be used for emotion classification.Then, embedding vectors are fed to SVM models in a two-stage approach.Once the first SVM model is trained on all sampled frames, using the confidence levels (how confident the model is of its classification of a frame; confidence levels are computed as the probabilities of possible outcomes for samples in the dataset) to choose the top frames for each emotional class.Then the second SVM model is retrained using only the highest confidence frames.Fig 1 shows a high-level overview of the pipeline.

EquiFACS classification pipeline
The EquiFACS data table for classification contains 296 rows (one for each video) and 25 columns: horse subject Id, emotional state and presence or absence of 23 different EquiFACS codes described above.The presence(absence) of a certain AU X in a specified video Y was marked as 1(0) on column of AU X and row of video Y.As part of the information was marked as not available, such entries were filled with 0.5.This data is then fed into a Decision Tree classifier.Fig 1 shows a high-level overview of the pipeline.

Model performance
For measuring the performance of the models, we use standard evaluation metrics of accuracy, precision, recall and F1 (see, e.g., [37,38] for further details).As a validation method [53], we use leave-one-subject-out cross validation with no subject overlap.Due to the relatively low numbers of horses (n = 30) in the dataset, following the stricter method is more appropriate [35,39].In our case this means that we repeatedly train on 29 subjects and test on the remaining subject.By separating the subjects used for training, validation and testing respectively, we enforce generalization to unseen subjects and ensure that no specific features of an individual are used for classification.

Results
Table 2 presents our main results: the performance comparison between video-based pipeline and EquiFACS-based pipeline.We can see that the video-based pipeline outperforms the EquiFACS-based one, reaching 76% accuracy for separation between all the classes, as opposed to only 69% by the latter pipeline.It should be noted that this good performance was reached in a process of two phases, described in Table 3.The advantage of the EquiFACS-based classifier which performs lower is however its explainability in the form of a decision tree.Confusion matrices for the video-based and Equifacs-based pipelines can be found in Figs 3 and 4 respectively.It can be seen that separation between Anticipation and Frustration is difficult for both models.Thus Table 2 also presents classification performance for three states when Anticipation and Frustration are treated as one state, which greatly increases performance.The separation between the two 'difficult' states of Anticipation and Frustration reaches 61% accuracy in the video-based model, but had very low (46%) accuracy for te EquiFACS-based model.

Discussion
The present study is the first to explore automated recognition of horse emotional states focusing on diverse facial expressions, based on a carefully designed controlled experimental setup for dataset creation and annotation.While facial muscular tone may decline with age [54] and facial morphology vary with factors like sex and breed, central to the idea of emotional expression is that reliable changes can be predicted regardless of these factors.Thus although factors like eye wrinkling might change with sex [55] and even be related to emotion, this expression cannot be used as a reliable general marker of emotion in horses, because of this difference.Therefore since we are interested in generic markers we do not attempt to model the effect of factors such as breed, gender or age into our models.
We presented classifier pipelines of two different types: deep learning video based and Equi-FACS-based.The former reaches 76% accuracy in separating the four emotional states, while the latter has lower performance (69%).This could be an indication that EquiFACS contains less information that raw video, and there are subtle nuances not captures by the EquiFACS annotation system.This is further strengthened by the fact that the deep learning classifier outperforms EquiFACS also in separating the two "difficult" cases of Anticipation vs. Frustration, reaching 61% accuracy.
An EquiFACS-based approach was used in the original study [30], which involved observer based-coding, and this approach, even in an automated process, has one crucial benefit: explainability.As discussed in [37,56], deep learning models have a 'black-box' nature, and it is important to understand how machines classify emotional states, exploring explainability (what is the rationale behind the machine's decision?), and interpretability (how is the model structure related to making such decision?)[57].These topics are fundamental in AI, and are addressed by a huge body of research [58,59].
The EquiFACS-based Decision Tree presented in this study allows us to answer such questions by observing the tree structure, represented by 'if-then' rules.From  tree, one can imply, e.g., that the machine chooses "Baseline" when neither of the action units AD19-Tongueshow, AD-51-Head_turn_left nor AD-52-Head_turn_right are present (see the leftmost branch of the tree).When AD-19-Tongue_show is present, either when the lower face part is visible or not (the VC72-Lower_face_not_visible indicator is present or not), the machine chooses "Disappointment" (see the rightmost branch of tree).Otherwise, "Anticipation or Frustration" is derived.This suggests that the system is using more than facial expression, but the wider movement of the head as part of the classification process.The risk of this being artefactual, arising from the design of the study (based on a food delivery system) needs to be carefully considered, and thus generalization about emotional state made with care.A similar phenomenon could arise within the generally superior deep learning video based approach, but we have no way of knowing this.Thus replication studies examining these emotions in horses in other contexts are essential and will strengthen the database used for deriving solid conclusions.
For a possible explanation of why the states of anticipation and frustration could not be well separated in our study, despite the fact that they had a stronger separation in the study of dogs [46], we refer the reader to Ricci-Bonot and Mills [30].Whilst the authors considered it possible that there was a lack of facial differences between positive anticipation and frustration in horses or that feeding in the context of the experiment is a largely frustrating event, it was also thought that it might be an artefact of using a 1-0 sampling method, which meant the detail within a video may not have been captured [30].The fact that the video-based deep learning model has 61% accuracy in this case indicates that some visual signal is there, and the explainability of this classifier should be explored in future research, with future experimental protocol designs aimed at better separation of these two emotional states.
Fig 1 presents the two pipelines.S1 Appendix presents further technical details on the pipeline.
Fig 2 (top) displays examples of cropped horse faces (top) and GrayST stacked frames (bottom) for each of the emotional states from left to right: 'Anticipation', 'Baseline', 'Disappointment', 'Frustration'.The bottom frames capture three consecutive frames: in the 'Baseline' case no movement of the horse is shown, while in the other three cases some movement is captured.